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Abstract 

We present a supen’ised binary encoding scheme for 
image retrieval that learns projections by taking into ac¬ 
count similarity between classes obtained from output em¬ 
beddings. Our motivation is that binary hash codes learned 
in this way improve both the visual quality of retrieval re¬ 
sults and existing supervised hashing schemes. We employ 
a sequential greedy optimization that learns relationship 
aware projections by minimizing the difference between in¬ 
ner products of binary codes and output embedding vec¬ 
tors. We develop a joint optimization framework to learn 
projections which improve the accuracy of supervised hash¬ 
ing over the current state of the art with respect to stan¬ 
dard and sibling evaluation metrics. We further boost per¬ 
formance by applying the supervised dimensionality reduc¬ 
tion technique on kernelized input CNN features. Experi¬ 
ments are performed on three datasets: CUB-2011, SUN- 
Attribute and ImageNet ILSVRC 2010. As a by-product of 
our method, we show that using a simple k-nn pooling clas¬ 
sifier with our discriminative codes improves over the com¬ 
plex classification models on fine grained datasets like CUB 
and offer an impressive compression ratio of 1024 on CNN 
features. 


1. Introduction 

Given a database of images, image retrieval is the prob¬ 
lem of returning images from the database that are most 
similar to a query. Performing image retrieval on databases 
with billions of images is challenging due to the linear time 
complexity of nearest neighbor retrieval algorithms. Image 
hashing[2, 12, 27, 16, 15, 19] addresses this problem by ob¬ 
taining similarity preserving binary codes which represent 
high dimensional floating point image descriptors, and of¬ 
fer efficient storage and scalable retrieval with sub-linear 
search times. These binary hash-codes can be learned in 
unsupervised or supervised settings. Unsupervised hashing 
algorithms map nearby points in a metric space to similar 
binary codes. Supervised hashing algorithms try to preserve 
semantic label information in the Hamming space. Images 
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Figure 1. We prefer results II over I because they tend to retrieve 
images of classesljaguar and tiger) related to the class label of the 
query(leopard), rather than unrelated classes!sharks and dolphins). 


that belong to the same class are mapped to similar binary 
codes. 

In this work, we develop a new approach to supervised 
hashing, which we motivate with the following example 
(Figure 1). Consider an image retrieval problem that in¬ 
volves a database of animals. A user provides the query 
image of a leopard. Now consider the following three sce¬ 
narios : 

1. If the retrieval algorithm returns images of leopards, 
we can deem the result to be absolutely satisfactory. 

2. If the retrieval algorithm returns images of dolphins, 
whales or sharks, we consider the results to be abso¬ 
lutely unsatisfactory because not only are dolphins not 
leopards, they do not look anything at all like leopards. 

3. If the retrieval algorithm returns the image of a jaguar 
or a tiger, we would be reasonably satisfied with the 
results. Although a jaguar is not the same as a leopard, 
it does look remarkably similar to one - it is a large 
carnivorous cat with spots and a tawny yellow coat. 

In the above example, leopards, dolphins, whales, 
sharks, jaguars and tigers all belong to different categories. 
However, some of these categories are more closely related 
to each other than to other categories. Animals which fall 
under the “big cat” ( Panthera ) genus are related to each 
other, as are the large aquatic vertebrates like dolphins. 
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sharks and whales. We designate the related categories as 
“siblings”. To study the relationships between categories, 
Weinberger et al. [25] suggested the concept of “output em¬ 
beddings” - vector representations of category information 
in Euclidean space. There has been extensive work on “in¬ 
put embeddings”, which are vector-space representations of 
images[23, 22, 1 1], but less work has been done on output 
embeddings, which map similar category labels to similar 
vectors in Euclidean space. In an output embedding space 
of animals, we would expect to have embeddings for labels 
so that, chimpanzees, orangutans and gorillas are near each 
other as are leopards, cheetas, tigers and jaguars. 

Our method, which uses output embeddings to construct 
hash codes in a supervised framework is called SHOE: Su¬ 
pervised Hashing with Output Embeddings (Figure 2). Our 
motivation for doing this is two-fold. First, it is our be¬ 
lief (validated experimentally) that we can construct better 
binary codes for a particular class given knowledge of its 
sibling classes. Secondly, if our algorithm is unable to re¬ 
trieve images of the same class, it will try to retrieve images 
of sibling classes, rather than images of unrelated classes. 
The assumption here is that if images of the same class as 
the query cannot be retrieved, images of sibling classes are 
more useful to the user than images of unrelated classes. 
We perform extensive retrieval experiments on the Caltech- 
UCSD Birds(CUB) dataset[28], SUN Attribute Dataset[20] 
and ImageNet[4], Our hash-codes can also be used to do 
classification, and we report accuracy on the CUB dataset 
using a nearest-neighbor classifier which is better than R- 
CNN and its variants [6, 31]. 

The contributions of our paper are as follows : 

1. To the best of our knowledge, our approach is the first 
to introduce the problem of learning supervised hash 
functions using output embeddings. 

2. We propose a joint learning method to solve the above 
problem, and perform retrieval and classification ex¬ 
periments to experimentally validate our method. 

3. We propose two new evaluation criteria - “sibling met¬ 
rics” and “weighted sibling metrics”, for gauging the 
efficacy of our method. 

4. We significantly boost retrieval and classification per¬ 
formance by applying Canonical Correlation Analysis 
[8] on supervised features, and learn hash functions us¬ 
ing output embeddings on these features. 

The remainder of this paper is arranged as follows. Sec¬ 
tion 2 describes related work. In Section 3 we describe our 
hashing framework, and carry out experiments in Sections 
4 and 5. We conclude in Section 6. 
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Figure 2. We perform k-means clustering on Word2Vec embed¬ 
dings [ 8] of ImageNet classes. The principle behind SHOE is that 
images belonging to related classes (like leopard or tiger, which 
are nearby in the output embedding space) are mapped to nearby 
binary codes (represented by points on a binary hypercube). Im¬ 
ages belonging to unrelated classes (like leopards and aircraft) are 
mapped to distant binary codes. This figure was created using [14] 
and is for illustrative purposes. 

2. Related Work 

Work on image hashing can be divided into unsuper¬ 
vised and supervised methods. For the purpose of brevity, 
we only consider supervised methods. Supervised Hash¬ 
ing algorithms are based on the objective function of mini¬ 
mizing the difference between hamming distances and sim¬ 
ilarity of pairs of data points. Supervised Hashing with 
Kernels(KSH)[16] uses class labels to determine the sim¬ 
ilarity. Points are considered ‘similar’ (value ‘1’) if they 
belong to the same class and ‘dissimilar 1 (value ‘-1’) oth¬ 
erwise. They utilize a simplified objective function using 
the relation between the hamming distance and inner prod¬ 
ucts of the binary codes. A sequential greedy optimiza¬ 
tion is adapted to obtain supervised projections. FastHash 
[15] also uses a KSH objective function but employs deci¬ 
sion trees as hash functions and utilizes a GraphCut based 
method for binary code inference. Minimal loss hashing 
[19] uses a structured SVM framework [30] to generate bi¬ 
nary codes with an online learning algorithm. 

All these methods except KSH categorize the input pairs 
to be either similar or dissimilar. KSH allows a similar¬ 
ity 0 for related pairs, but the authors only use it to define 
metric neighbors and not for semantic neighbors. FastHash 
entirely ignores the related pairs, as it weighs the KSH loss 
function by the absolute value of the label. Also, their work 
does not support a similarity value other than 1 and -1, as 
it violates the submodularity property - a crucial property 
required to solve the problem in parts. To the best of our 
knowledge, we are the first to use similarity information 
of sibling classes in a supervised hashing framework. Our 
work is also different from others as we compute similarity 
from output embeddings and use a joint learning framework 












to learn the sibling similarity. We now discuss related work 
about output embeddings. 

Output embeddings can be defined as vector represen¬ 
tations of class labels. These are divided into two types: 
data-independent and data-dependent. Some of the data- 
independent embeddings include [9, 28, 13, 20, 24]. Lang¬ 
ford et al. [9] constructed output embeddings randomly 
from the rows of a Hadamard matrix where each embed¬ 
ding was a random vector of 1 or -1. Embeddings con¬ 
structed from side information about classes such as at¬ 
tributes, or a Linnean hierarchy are available with datasets 
such as [28, 13, 20], In WSABIE[29], the authors jointly 
learn the input embeddings and the output embeddings to 
maximize classification accuracy in a structural SVM set¬ 
ting. Akata et al.[ I] uses the WSABIE framework to learn 
fine grained classification models by mapping the output 
embeddings to attributes, taxonomies and their combina¬ 
tion. 

The binary hash-codes that our method, SHOE, learns 
on the CUB and SUN datasets use attributes as output em¬ 
beddings, just like [1], For ILSVRC2010 experiments, we 
use a taxonomy derived embeddings similar to [24], In a 
taxonomy embedding, a binary output embedding vector is 
obtained, where each node in the class hierarchy and its an¬ 
cestors are represented as 1 while non-ancestors are repre¬ 
sented as 0. Deng et al. [3] show that classification that 
takes hierarchies into account can be informative. Mikolov 
et al. [18] use a skip-gram architecture trained on a large 
text corpus to learn output embeddings for words and short 
phrases. These Word2Vec embeddings are used by [5] for 
large scale image classification and zero-shot learning. Fi¬ 
nally, output embeddings can even be learned from the data. 
For example, [17] exploits co-occurences of visual concepts 
to learn classifiers for unseen labels using known classifiers. 
All these methods use output embeddings for classification 
and zero shot learning, but none have used them to learn 
binary codes for retrieval. 

3. Method 

3.1. Preliminaries 

Given a training set M = {(xi, yi ),...., (xjy, Un)} of N 
(imagejabel) pairs with Xi £ X and y* £ y, let <j> : X —> 
X £ 7 Z d be the input embedding function and if) : y —> y £ 
lZ e be the output embedding function. We wish to learn 
binary codes 6,, bj of length c (i.e., bi £ {—1,1} C ) such that 
for pairs of training images, the Hamming distance between 
the codes preserve the distance between their class labels 
(given by their corresponding output embedding vectors). 
In other words, for a given query image, retrieved results of 
sibling(unrelated) classes ought to be ranked higher(lower). 
To this end, we obtain the following objective function: 
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min 0(6) - ^2 ^2( d H {bi, bj) - d E {tp(yi), V’fc))) 2 - 0) 

i =i 3=1 

where dff(bi, bj) is the Hamming distance between binary 
codes bi and bj and is the Euclidean distance between 
the output embedding vectors and tpiUj)- 

For an input image x with input embedding vector cj>(x), 
we obtain binary code b of length c bits. Each bit is com¬ 
puted using a hash function hi{x) that takes the form : 

hi{x) = sgn(wi(j){x)),wi £ TZ d . (2) 

To learn c such hash functions H = {h[\l = 1,..., c}, we 
learn c projection vectors W = [wi,W 2 ■ ■ ■, tu c ], which we 
compactly write as H{x) = sgn(W4>(x)), W £ lZ cxd . 
Without loss of generality, we can assume that <j>{x) is a 
mean-centered feature. This ensures we obtain compact 
codes by satisfying the balanced property of hashing (i.e, 
each bit fires approximately 50% of the time)[27], <j>(x) can 
be an input embedding that maps images to features in ei¬ 
ther kernelized or unkernelized forms. 

Solving the optimization problem in Equation (1) is 
not straightforward, so we utilize the relation between in¬ 
ner product of binary codes and Hamming distances [26, 
16], given as 2 dn(bi,bj) = c — bfbj, where bfbj = 
Yld=\hi{4 > { x i))h'i((l>{xj)) is the inner product of the bi¬ 
nary codes bi and bj. Note that the inner product of bi¬ 
nary codes lies between — c and +c, while the Hamming 
distance ranges from 0 to c, where the distance between the 
nearest neighbors is 0 and between the farthest neighbors 
is_c . By unit normalizing the output embedding vectors, 
||^(y) || = 1, we exploit the relationship between Euclidean 
distances and the dot products of normalized vectors, given 
as dEiiiy^^iyj))) 2 = 2-2t/i(y i ) T t/i(y i ), and obtain the 
following objective function: 

N N c 

min^ ^(- hi{4>{xi))hi{<j>{xj)) - V'(yi) T V 5 fe)) 2 (3) 

i=1 3 = 1 1=1 

Let Oij = '4>(yi) T i>(yj) and as a consequence of the unit 
normalization of ip(yi), —1 < o tJ < 1, which implies that 
the similarity between same classes is 1 and similarity be¬ 
tween different classes is as low as -1. The objective en¬ 
sures that the learned binary codes preserve the similarity 
between output embeddings, which is required for super¬ 
vised hashing and our goal of ranking related neighbors be¬ 
fore farthest neighbors. 

This is similar to the KSH[16] objective function, ex¬ 
cept that KSH assumes that takes only values 1(—1) for 
similar(dissimilar) pairs defined with semantic information. 
Their work also accomodates the definition of the = 0 
for related pairs but only for metric neighbors. Our work 
is different from theirs, as we emphasize the learning of bi¬ 
nary codes that preserve the similarity between the classes. 
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Figure 3. Retrieval on CUB dataset comparing our method SHOE(E) with the state-of-the art hashing techniques. The above plots report 
precision, sibling precision and weighted sibling precision for top 5 sibling classes for bits c = {16, 32,64,128, 256}. The table reports 
mAP, Sibling and Weighted Sibling mAP for c = 64 bits. 


whereas Oij captures the similarity between the classes. Re¬ 
gardless of the definition of o,; 7 , our optimization is similar, 
and so we employ a similar sequential greedy optimization 
for minimizing 0(b). We refer the reader to [ 1 6] for further 
details. 

3.2. Evaluation Criteria 

Standard metrics like precision, recall and mAP defined 
for semantic neighbors are not sufficient to evaluate the re¬ 
trieval of the sibling class images. To measure this, we de¬ 
fine sibling precision , sibling recall and sibling average pre¬ 
cision metrics. Let R y : ( y\,yj ) — > rank return the rank 
of class yj for a query class xji, 0 < rank < L, where 
L is the number of classes. Note that, R y (y,y) = 0 for 
the same class. The ranking R y is computed by sorting the 
distance between the output embedding vectors ip(y) and 
i/j( y *), y* £ y\y. We obtain the weight of the sibling 
class used for evaluation using the following functions for 
Sibling(S7& m ) and Weighted Sibling(Sz6“) metrics: 

Sib 

m ■ (yi,yj,Ry) HRviyi’Vj)) < m ) 


Sib ™ : (j/i, yj,Ry) — — R y^ VllV ^ * I (rank < to) 


where, to is the number of related classes for each query 
class and I(.) is the Indicator function that returns 0 when 
rank > to. The Sibling and Weighted Sibling preci¬ 
sion @k, recall@k and mAP is defined as: 
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In the above equations, y qi refers to the class of the I th 
retrieved image. 


3.3. Preliminary Experiments 

We evaluate the proposed hashing scheme that takes into 
account structure of the related classes with the following 
performance metrics: Precision@k, mAP and their sibling 
versions previously defined. 

• Datasets: To test our approach, we use datasets that 
contain information about class structure. The CUB- 
2011 dataset [28] contains 200 fine-grained bird cate¬ 
gories with 312 attributes and is well suited for our pur¬ 
pose. In this dataset, although both binary and contin¬ 
uous real-valued attributes are available, we use only 
the mean-centered real-valued attributes as output em¬ 
beddings ip{y)- We obtain ranking R y for each class y 
based on these attribute embeddings. There are 5994 
train and 5774 test images in the dataset. We select 
a subset of the dataset of size 2000 for training, use 
the whole train set for retrieval and all test images as 
queries. Ground truth for a query is defined label-wise 
and each query class has approximately 30 same class 
neighbors in the retrieval set. For input embeddings, 
we extract state-of-the-art 4096 dimesional Convolu¬ 
tion Neural Network (CNN) features from the fc7 layer 
for each image using the Caffe Deep Learning library 
developed by [10]. We kernelize the CNN features 
which take the form -Y^=i K,(x,Xi) where k is a ra¬ 
dial basis kernel, and p is the cardinality of a subset of 
training sample, designated as “anchor points”. We re¬ 
fer to these as CNN+K features and they are inherently 
mean-centered. 

• Comparison methods: We compare our method with 
raw output embeddings:SHOE(E), with the follow¬ 
ing supervised and unsupervised hashing schemes: 
KSH[16], FastHash[l5], ITQ[7] and LSH[2], We use 
their publicly available implementations and set the 
parameters to obtain the best performance. It is im¬ 
portant to note that none of these methods utilize the 
distribution of class labels in the output embedding 
space. The closest comparison would be to use KSH, 
setting the similarity of semantic class neighbors to 0 
value. For this purpose, we obtain top-m related pairs 


































for each class using R y and set the similarity to 0 for 
related pairs. To evaluate the unsupervised ITQ and 
LSH, we zero-center the data and apply PCA to learn 
the projections. We use CNN+K features with p = 300 
for evaluating SHOE(E) and KSH since we learn linear 
projections in these methods, unlike FastHash which 
learns non-linear decision trees on linear features. 

• Results: Figure 3 shows the precision @30, sibling 
and weighted sibling precision @30 plots for 5 related 
classes, i.e. m = 6(+l for the same class) by en¬ 
coding the input embeddings to bits of length c = 
{16,32,64,128, 256}. The table in Figure 3 shows the 
recall and mAP and its sibling variants for 64 bits. We 
observe that SHOE(E) does better than baselines for 
both sibling and weighted sibling precision metrics for 
top-30 retrieved neigbhors for all bit lengths, but there 
is a loss in precision compared to the KSH method. 
In their paper, FastHash[ 1 5] shows better performance 
compared to KSH. However, it does not perform well 
here because of the large number of classes and few 
training samples available per class. 

3.4. Analysis 
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Figure 4. Retrieval on CUB dataset evaluating the performance 
of our method(SHOE) for varying 9 values and p = 1000. The 
left and right y-axis show the standard metrics and sibling metrics 
respectively. 

Contrary to our expectation that related class informa¬ 
tion improves both standard and sibling performance met¬ 
rics, experiments in 3.3 show that using similarity directly 
from output embeddings actually reduces the performance 
for same classes, while improving it for sibling classes. To 
analyse this, we obtained top-m related classes using rank¬ 
ing R y and assigned a constant o, :l = 6 value (previously, 
in Equation(3), we had defined o.y = ^{yj)). 9 

measures the similarity between a class and a related class. 
Figure 4 show the performance of our method(SHOE) for 
varying 9 values for 64 bits on CUB-2011 dataset. Re¬ 
sults reveal that, when a fixed similarity is used, we actu¬ 
ally gain performance from the sibling class training exam¬ 
ples, and this gain is maximized for negative values of 9, 
i.e —1 < 9 < 0. The intuition behind this is: when 9 
is close to 1, the learned hash-code would not discriminate 


well enough between identical classes and sibling classes. 
For instance, in our “database of animals” example, we 
would learn hash-codes that nearly equate leopards with 
jaguars, which is not what we desire. On the other hand, 
when 9 is close to -1, a hash-code for a leopard image will 
be learned mostly from other training images of leopards, 
but with slight consideration towards training images of its 
sibling classes. When 9 is assigned -1, the sibling classes 
aren’t considered at all, so our method becomes identical to 
KSH[16], We are now interested in learning 9 simultane¬ 
ously with the hash functions during the training phase. 

3.5. SHOE Revisited 

We observe that the objective function that we want to 
minimize in Equation (3) can be split into three parts - for 
identical classes, sibling classes and unrelated classes, re¬ 
spectively. We also observe from the preceding analysis that 
precision and recall metrics improve for negative values of 
9. Therefore, we add regularizer term A ||0 + 1|| 2 to the ob¬ 
jective function, which becomes small when 9 lies close to 
-1. For easier notation, we denote = hi((f>(xi)). Our 
modified objective function now becomes : 

N N C N N C 

^E E^E^A ~i) 2 + E E (;;X>A-0) 2 

i,j £ same 1=1 i,sibling 1=1 

class class 

N N c 

+y y, +1) 2 +\\\ 9 +1 \\ 2 w 

i,j€ unrelated 1=1 
class 

C 

Let Rij c = l y hijhi- denote the sum of the inner 

;=t 

product of the binary codes bi and bj of length c. We now 
compute the derivative of Equation (8) w.r.t 9, set it to 0 and 
solve for 9. We obtain : 
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E E (^ C -A) 

i,jE sibling 

9 = e c = — cl ; ss — ( 9 ) 

c(n sib + A) 

where n S i b is the number of sibling pairs in the training data. 
We have thus obtained a closed form solution for the opti¬ 
mal 9. However, we cannot calculate 9 directly as we do not 
learn all the bits at once. Therefore, we employ a two step 
alternate optimization procedure that first learns the bits and 
then an approximate 9i value calculated from the previously 
learned bits. For the first iteration, we use an initial 9q value, 
that satisfies the constraint: -1 < On < 0. The two step op¬ 
timization procedure for learning the I th hash function is: 

1. Step 1 : We optimize for Equation (3), keeping 0;_i 
constant and updating the projection vector W, thus 
learning hash-code bits hi(<p(xi)). 
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Figure 5. Retrieval on CUB-2011 (first and second column) and SUNfthird and fourth column) dataset comparing our methods SHOE(E) 
and SHOE(L) with the state-of-the art hashing techniques. The above plots report mAP, Sibling and Weighted Sibling mAP for top 5 
sibling classes. 


2. Step 2 : We keep the hash-code bits hi{(j>{xi)) constant 
and learn for 6i using Equation (9). 

3.6. Supervised Dimensionality Reduction 

As our datasets contain class label information and cor¬ 
responding output embeddings, we have explored the idea 
of supervised dimensionality reduction for input embed¬ 
dings <p(x) € R d to oj(x) £ R c (c < d), given the output 
embeddings ip(y) £ R E ■ There are many supervised di¬ 
mensionality reduction techniques available in the literature 
like Canonical Correlation Analysis (CCA) [8] and Partial 
Least Square Regressions [21], for example. In particular, 
we have used CCA [8] (</> —► w) to extract a common latent 
space from two views that maximizes the correlation with 
each other. [7] also leveraged the label information using 
CCA to obtain supervised features prior to binary encod¬ 
ing. However, they limit their output embeddings to take the 
form of one vs remainder embeddings: : ip(y) £ {0,1} L is 
a L-dimensional binary vector with exactly one bit set to 1 
i.e. i/j(y) y = 1 where L is the number of class labels. On 
the other hand, we apply CCA to the general form of output 
embeddings that are real valued continuous attributes cap¬ 
turing structure between the classes. We observe that when 
supervised features with CCA-projections are used, we ob¬ 


tain a significant boost in performance (« 100% improve¬ 
ment) for all of our evaluation metrics. 

4. Experiments 

We evaluate our method on the following datasets: 
Caltech-UCSD Birds (CUB) Dataset [28], the SUN At¬ 
tribute Dataset [20] and Imagenet ILSVRC2010 dataset [4], 
We extract CNN features from [10], as mentioned in Sec¬ 
tion 3.3. In the case of CUB dataset, we extract CNN fea¬ 
tures for the bounding boxes that accompany the images. 
For CUB and SUN datasets, we create two training sets of 
different size to examine the variation in performance with 
number of training examples. For all the datasets, we define 
the ground truth using class labels. 

Datasets: We have described the CUB dataset in Sec¬ 
tion 3. The ImageNet ILSVRC 2010 dataset is a subset of 
ImageNet and contains about 1.2 million images distributed 
amongst 1000 classes. We uniformly select 2 images per 
class as a test set and use the rest as retrieval set. We se¬ 
lect 5000 training and p = 3000 anchor point images by 
uniformly sampling across all classes. We obtain the out¬ 
put embeddings for the ImageNet class using the method of 
Tsochantaridis[24]. Each of the 74401 synsets in ImageNet 






































































































Method 

CUB-2011(5000 train) 

SUN Attribute(7000 train) 


prejmAP 

Sibp re j m Ap 

Qjh w 

° iU pre\mAP 

pre\mAP 

Sib pre \mAP 

Qqh W 

^ Lu pre\mAP 


@30 

mAP 

@30 

mAP 

@30 

mAP 

@10 

mAP 

@10 

mAP 

@10 

mAP 

SHOE(L)+CCA 

0.481 

0.527 

0.701 

0.529 

0.617 

0.436 

0.169 

0.201 

0.344 

0.239 

0.269 

0.193 

SHOE(E)+CCA 

0.387 

0.429 

0.668 

0.533 

0.554 

0.429 

0.112 

0.134 

0.270 

0.205 

0.199 

0.152 

KSH+CCA 

0.488 

0.526 

0.661 

0.290 

0.595 

0.303 

0.191 

0.220 

0.300 

0.130 

0.252 

0.126 

ITQ+CCA 

0.242 

0.256 

0.339 

0.125 

0.299 

0.129 

0.062 

0.070 

0.108 

0.044 

0.088 

0.041 

FastHash 

0.240 

0.246 

0.341 

0.120 

0.298 

0.124 

0.021 

0.021 

0.038 

0.017 

0.030 

0.014 

LSH 

0.024 

0.017 

0.082 

0.049 

0.054 

0.032 

0.010 

0.009 

0.034 

0.018 

0.022 

0.012 


Table 1. Comparing Precision, mAP and their sibling variants with our methods(SHOE(L) and SHOE(E)) and several baselines for 128 
bits. We apply CCA projections for all the methods except FastHash and LSH as their performance decreases. For sibling metrics, SHOE 
performs significantly better than the baselines and performs as well as the baselines for standard metrics. 


is a node in a hierarchy graph and using this graph, we ob¬ 
tain the ancestors for each class in ILSVRC 2010 dataset. 
We then construct a matrix O lTnagenet = {0, 1}iooox744oi, 
where the j th column of the i th row is set to 1 if the j th 
class is an ancestor of the i th class, 0 otherwise. Thus, the 
output embedding of each class is represented by a row of 

O Imagenet- 

The SUN Attribute dataset[20] contains 14340 images 
equally distributed amongst 717 classes, accompanied by 
annotations of 102 real valued attributes. We partition the 
dataset into equal retrieval and test sets, each containing 
7170 images. We derive two variants from the retrieval 
set - the first has 3585(5 per class) training and 1434(2 per 
class) anchor point images, while the second has 7000(10 
per class) training and 3000( 4 per class) anchor point im¬ 
ages. In this dataset, each test query has 10 same class 
neighbors and 50 sibling class neighbors in the retrieval set. 
We compute a per class embedding by averaging embed¬ 
ding vectors for each image in the class. 

Evaluation Protocol: For binary hash-codes of length 
c = {16, 32, 64,128, 256}, we evaluate SHOE using stan¬ 
dard, sibling and weighted sibling flavors of precision, re¬ 
call and mAP. Two variants of SHOE are used - SHOE(E), 
which uses raw output embeddings, and SHOE(L), which 
is learned using the method in Section 3.5. For the smaller 
CUB and SUN subsets, we compute mAP and their sibling 
versions. For the big subsets, in addition to mAP, we also 
compute precision. In the case of ImageNet, we compute 
precision@50, recall@ 10K, mAP and their sibling variants. 

Results: We compare our method to the methods which 
are well known in image-hashing literature mentioned in 
section 3.3 and present the results in Figure 5 and Table 1. 
We use CNN+K features as input embeddings for SHOE 
and KSH. For the other methods, we use only CNN features. 
The rows represent weighted sibling, sibling and mAP met¬ 
rics. The first two columns represent experiments with¬ 
out and with CCA projections on the CUB dataset, while 
the latter two columns represent the same on the SUN At¬ 


tribute dataset. From these plots, we see that our method 
SHOE(red and green), without CCA projections, comfort¬ 
ably outperforms the best baseline, KSH(blue) on all met¬ 
rics for both datasets. Using the CCA projections, we sig¬ 
nificantly improve the performance(by i=s 100%) for SHOE, 
KSH and ITQ, while it lowers the performance of FastHash 
and LSH. Our method with CCA projections performs as 
well as the KSH baseline on the mAP metric and outper¬ 
forms on the sibling or weighted sibling metrics. The gap in 
performance between SHOE and KSH reduces when CCA 
projections are applied on the standard metrics, but not on 
the sibling metrics. On ImageNet, we only show results 
with CCA projected features. We observe a similar trend, 
with SHOE surpassing KSH on recall@10K and sibling 
precision metrics, while performing as well as KSH on pre- 
cision@50 and mAP. 

5. Fine-grained Category Classification 

In this section, we demonstrate the effectiveness of our 
proposed codes for fine-grained classification of bird cate¬ 
gories in CUB-2011 dataset. We propose a simple nearest 
neighbor pooling classifier that classifies a given test image 
by assigning it to the most common label among the top-fc 
retrieved images. Let R x (q , x) rank give the ranking of 
the images retrieved based on our binary codes. Thus, rank 
is 1 for the nearest neigbhor and rank is N for the farthest 
neighbor, where N is the size of the database. Given such 
ranking, with A4 q denoting the top -k ranked neighbors of a 
new query q, we define the k-nn pooling classifier as: 

class pre dict{q) = argmax V] I (dass(x) == y) (10) 
v L ' 
xGAi q 

We use the above model to obtain the classification ac¬ 
curacy on the CUB dataset with 200 categories from top- 
10 neighbors. In particular, we obtain the following ac¬ 
curacies: top-1 accuracy(top-l) measures if the predicted 
class matches the ground truth class, top-5 accuracy(top- 
5) measures if one of the top-5 predicted classes match the 
























Recall@10K:ILSVRC2010 


Sib @10K:ILSVRC2010 


Sib @10K:ILSVRC2010 


-•-ITQ 
—•— KSH 

-^-SHOE(Lcarn) 






§2 


64 128 

Number of bits 


256 



64 128 

Number of bits 


64 128 256 

Number of bits 


Method 

pre 

Spre 

o W 

'-’pre 

SHOE(L)+CCA 

24.5 

32.6 

29.1 

KSH+CCA 

24.8 

30.4 

28.0 

ITQ+CCA 

6.5 

10.8 

9.9 

Method 

mAP 

SmAP 

s mAP 

SHOE(L)+CCA 

0.039 

0.021 

0.022 

KSH+CCA 

0.036 

0.010 

0.014 

ITQ+CCA 

0.005 

0.001 

0.002 


Figure 6. Retrieval on ILSVRC2010 dataset comparing SHOE with state-of-the art hashing techniques. We use 5 K training samples and 
CNN+K+CCA as features for all the binary encoding schemes. The above plots report recall, Sib re , Sib™ e @ 10K for top 5 sibling classes 
for bits c = {32, 64,128, 256}. Table reports precision@50, mAP and their sibling versions for 256 bits. 



Figure 7. The first query is of an ovenbird. SHOE retrieves more 
ovenbirds than KSH. The second query is of a Brewer black-bird. 
Neither SHOE nor KSH retrieve Brewer black-birds. However, 
SHOE returns ravens, which are sibling classes of Brewer black¬ 
birds, whereas KSH retrieves pileated woodpeckers, which are un¬ 
related to black-birds. Here, blue borders represent sibling classes. 


ground truth class and sibling accuracy(sib) measures if the 
predicted class is one of the sibling classes of the ground 
truth class. As a baseline, we train a linear SVM model on 
the CNN features. We compare our proposed binary codes 
SHOE(L), KSH and state-of-the-art fine grained classifica¬ 
tion models that use CNN features. For this experiment, 
we use bounding box information, but do not use any part- 
based information available with the datasets. Hence, we 
do a fair comparison between methods with no part-based 
information. Table 2 shows the classification performance 
over the 5794 test images with approximately 30 images for 
each of the 200 categories. 

Features: For each of the binary coding schemes(SHOE, 
KSH, ITQ), we use the CNN+K+CCA(kernelized CNN 
with CCA projections) features as input embeddings and 
mean-centered attributes as the output embeddings. For the 
experiments, we used only 128 bit codes, while CNN fea¬ 
tures are 4096 dimensional vectors. 

Results: We observe that not only do the proposed bi¬ 
nary codes obtain a marginal improvement in performance 
over the complex classification models in [6] [31], but they 


Method 

top-1 

top-5 

sib 

Compression 

Baseline(SVM) 

50.6 

75.6 

70.19 

1 

SHOE(L)+CCA 

52.51 

77.8 

72.4 

1024 

KSH+CCA 

52.48 

75.1 

69.06 

1024 

ITQ+CCA 

27.5 

43.4 

37.6 

1024 

R-CNN[6] 

51.5 

- 

- 

1 

Part-RCNN[31] 

52.38 

- 

- 

1 


Table 2. Comparing classification accuracies for CUB dataset. For 
top-1 and sibling accuracy, we used k = 10 neighbors. To obtain 
top-5 accuracy, we used k = 50 neighbors. For the binary coding 
schemes, we used only 5000 of the 5994 train images to obtain 
128 bits, while the classification models are trained on the full set. 
indicates that the information is not available in their paper. 

also offer an astounding compression ratio of 1024 . Also, 
the training and testing times of binary coding schemes 
are significantly smaller than those with SVM classification 
models. 

6. Conclusion 

The key idea in our paper is to exploit output embed¬ 
dings that capture relationships between classes and we use 
them to learn better hash functions for images. In our work, 
a method to learn class similarity jointly with the hash func¬ 
tion was devised, along with new metrics for their evalua¬ 
tion. Our method SHOE, achieved state-of-the art image 
retrieval results over multiple datasets for hash codes of 
varying lengths. Our second innovation was to utilize CCA 
to learn a projection of features with output embeddings, 
which resulted in significant gains in both retrieval and clas¬ 
sification experiments. Upon applying this approach to all 
methods, we perform as well or better than all baselines 
over all datasets. 
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